Delirium prediction in the ICU: designing a screening tool for preventive interventions

Abstract Introduction Delirium occurrence is common and preventive strategies are resource intensive. Screening tools can prioritize patients at risk. Using machine learning, we can capture time and treatment effects that pose a challenge to delirium prediction. We aim to develop a delirium prediction model that can be used as a screening tool. Methods From the eICU Collaborative Research Database (eICU-CRD) and the Medical Information Mart for Intensive Care version III (MIMIC-III) database, patients with one or more Confusion Assessment Method-Intensive Care Unit (CAM-ICU) values and intensive care unit (ICU) length of stay greater than 24 h were included in our study. We validated our model using 21 quantitative clinical parameters and assessed performance across a range of observation and prediction windows, using different thresholds and applied interpretation techniques. We evaluate our models based on stratified repeated cross-validation using 3 algorithms, namely Logistic Regression, Random Forest, and Bidirectional Long Short-Term Memory (BiLSTM). BiLSTM represents an evolution from recurrent neural network-based Long Short-Term Memory, and with a backward input, preserves information from both past and future. Model performance is measured using Area Under Receiver Operating Characteristic, Area Under Precision Recall Curve, Recall, Precision (Positive Predictive Value), and Negative Predictive Value metrics. Results We evaluated our results on 16 546 patients (47% female) and 6294 patients (44% female) from eICU-CRD and MIMIC-III databases, respectively. Performance was best in BiLSTM models where, precision and recall changed from 37.52% (95% confidence interval [CI], 36.00%–39.05%) to 17.45 (95% CI, 15.83%–19.08%) and 86.1% (95% CI, 82.49%–89.71%) to 75.58% (95% CI, 68.33%–82.83%), respectively as prediction window increased from 12 to 96 h. After optimizing for higher recall, precision and recall changed from 26.96% (95% CI, 24.99%–28.94%) to 11.34% (95% CI, 10.71%–11.98%) and 93.73% (95% CI, 93.1%–94.37%) to 92.57% (95% CI, 88.19%–96.95%), respectively. Comparable results were obtained in the MIMIC-III cohort. Conclusions Our model performed comparably to contemporary models using fewer variables. Using techniques like sliding windows, modification of threshold to augment recall and feature ranking for interpretability, we addressed shortcomings of current models.


INTRODUCTION
The diagnosis of delirium is common in critically ill patients and depending on the patient population its incidence can be up to 80%. 1 Typically delirium rates have ranged between 10% and 23%, and half of them acquire delirium in the intensive care unit (ICU). 2 Delirium leads to increased hospital length of stay and need for prolonged institutionalization for critically ill patients. [3][4][5] Delirium drives up healthcare costs, and its impact often persists beyond the ICU including risk for functional decline in daily living activities, and long-term cognitive impairment. [6][7][8][9][10] Treatment and prevention of delirium is dependent on identifying the complex interplay of multiple triggers in the ICU. 11 A multimodal strategy of evidence-based best-practice recommendations aimed at coordinating multidisciplinary care to reduce delirium risk and expedite ICU discharge commonly referred to as the ABCDEF bundle is effective in both preventing and treating delirium. 12,13 This bundle outlines in detail how we assess, prevent and manage pain, perform both spontaneous awakening and breathing trials daily in intubated and mechanically ventilated patients, choose analgesic and sedative agents, assess, prevent and manage delirium, incorporate early mobility, and engage family members in the care of these critically ill patients. Unfortunately, this bundle of interventions requires education of caregivers, coordination between a multidisciplinary team, is labor and resource intensive, and therefore not consistently implemented across all ICU patients and all health care settings 12,14 A screening tool to prioritize ABCDEF implementation to those who are most vulnerable can be an invaluable tool to maximize the benefit of the resource-intensive preventive measures.
Current assessment tools, such as the Confusion Assessment Method for the Intensive Care Unit (CAM-ICU), only diagnoses delirium after its onset. 15 Administering CAM-ICU requires specialized training. Although each hospital has its own protocol for delirium, because of its time-consuming nature CAM-ICU is infrequently done compared to other vital signs and diagnosis can be delayed. Although certain patient characteristics, such as age, illness severity, and certain medications, are considered high risk for development of delirium or while elevations in inflammatory biomarkers possibly associated with severe disease, these risk factors have been inconsistent in their ability to predict the onset of delirium. [16][17][18] Previous prediction models trained on small patient cohorts lacked adequate power to capture the complex relationships between delirium and the time-varying predictor variables. 19,20 In attempting to improve the performance, larger administrative datasets were used to develop prediction models, using several hundred variables. However, these models lack interpretability, and are almost impossible to adopt in day-to-day practice. 21 Additionally, most of these models are not specific to the critically ill population and cannot be extrapolated to the ICU. 20 We propose to build a screening tool for delirium by developing and fine tuning a delirium prediction model that requires fewer variables than existing models, using large development and validation cohorts in comparison to the existing literature 18 and that can predict the risk of delirium in a continuous fashion using a sliding window. Using both conventional machine learning methods and deep learning algorithms, we will evaluate performance of our model across various observation and prediction windows to address the issues of variability across time and treatment effects. In addition, we will rank the independent variables in order of their predictive importance to help with interpretability. These attributes should help pave the way for implementation of a screening tool to help caregivers at the bedside.

Ethical review
The analysis using the eICU Collaborative Research Database (eICU-CRD) is exempt from institutional review board approval due to the retrospective design, lack of direct patient intervention, and the security schema, for which the re-identification risk was certified as meeting safe harbor standards by an independent privacy expert (Privacert, Cambridge, MA, USA) (Health Insurance Portability and Accountability Act Certification no. 1031219-2). The data in the Medical Information Mart for Intensive Care version III (MIMIC-III) are de-identified, and the institutional review boards of the Massachusetts Institute of Technology

Study population
The eICU-CRD is a freely available multicenter database comprising 200 859 patient unit encounters for 139 367 unique patients admitted between 2014 and 2015 in over 200 hospitals located throughout the United States. 22 The MIMIC-III database is an open-access single-center ICU database including 53 423 distinct hospital admissions for 46 476 unique patients admitted from 2001 to 2012. 23 Both datasets comprise data on patient demographics, vitals, clinical flowsheets, laboratory values, medications, interventions, and outcomes. We included all adult patients up to age of 89, with an ICU length of stay of at least 24 h and having had at least one CAM-ICU assessment, as shown in the cohort selection diagram (Supplementary eFigure 1). Patients have been admitted for multiple reasons to these ICUs. Their admission diagnoses have not been a factor in the selection of our patients.

Delirium assessment
Observation window refers to the period where patient data are collected, and the model is derived. Prediction window refers to the period from when the observation window ends up to the onset of the outcome, CAM-ICU positive in our case. We observed patients from 0 to 12 h, 0 to 24 h and 0 to 48 h. We predicted the incidence of Delirium for the next 12, 24, 48, 72, and 96 h. The diagnosis of delirium was made when at least one CAM-ICU value was positive. 15 In instances with multiple CAM-ICU assessments, onset of delirium was determined from the time of the first positive CAM-ICU.

Variable selection
The rationale for selection of independent variables was based on their ability to predict delirium in prior literature, availability in our databases, ease of extracting and monitoring in a real-time environment. We identified 21 categorical or numerical variables classified into demographic data, vital signs, laboratory values, and vasopressor dose that fulfilled above criteria. [24][25][26][27][28][29][30][31][32][33][34][35] We also calculated daily sequential organ failure assessment (SOFA) scores to provide overall patient status. Since admission diagnoses or past medical history were not consistently available in the applied datasets, we excluded them. Downstream variables such as outcomes would not be available in real-time and similarly excluded. Initiation of delirium therapies like antipsychotic drugs could be a reaction to onset of delirium, and hence excluded to avoid confounding. Table 1 lists all the variables used.

Data preprocessing
All variables were aggregated into hourly intervals, where the last known value was used as a candidate for that interval. In cases where the last value for each variable is not measured in the interval, the representative of that interval was computed by averaging the available measurements in the interval. Missing values that were collected hourly like vital signs were imputed by forward and then if needed backward imputation. Categorical variables were converted into a vector to capture the semantics of each category at the model derivation phase. Specifically, we reshaped the data that was fed in 3 dimensions for BILSTM to 2-dimension input for Logistic Regression (LR) and Random Forest (RF) to have the same input for each model and to provide a fair comparison for each model. For all continuous variables, we utilized the recorded value in the database without any adaptation. Heat map further details the set of variables, including linear correlations between each variable (Supplementary eFigure 2).

Model derivation and validation
We evaluated the results based on 5-fold stratified cross-validation. This method partitions the data into 5 equal segments. Respectively training and validation phases are done in 5 iterations in a manner that within each iteration a different segment of the data is held out for validation, while the remaining 4 segments are considered as a derivation set to train the model. Typically, metrics calculated based on the k-fold stratified cross-validation can effectively assess overfitting and has lower variance. 36 We used 3 sets of algorithms to evaluate delirium prediction, namely LR, RF, and Bidirectional Long Short-Term Memory (BiLSTM). BiLSTM represents an evolution from recurrent neural network-based Long Short-Term Memory (LSTM), and with a backward input, preserves information from both past and future, producing more accurate predictions. 37,38 Considering that both LR and RF are unable to process time series variables efficiently, we pre-processed the clinical variables and all-time steps and corresponding variables were flattened into a single record. This was done to ensure that both LR and RF have access to the same data about the changes in patient state as BiLSTM, to ensure a fair performance comparison.

Statistical analysis
The classification results for delirium prediction are reported using the Area Under Receiver Operating Characteristic (AUROC), Area Under Precision Recall Curve (AUPRC), Recall, Precision (Positive Predictive Value), and Negative Predictive Value. Furthermore, we also investigated calibration quality of our models.

Model interpretability
Although there are many definitions of interpretability, we focused on how the model ranks each input variable with respect to outcome prediction. LR and RF have been successfully employed in the clini- cal domain due to their ease of interpretation; however, they require additional processing to handle high-dimensional, longitudinal, and irregular electronic health record datasets. 39 In this context, we employed the Shapley Value Sampling (SVS) method to probe the Bi-LSTM model. 40 SVS is a perturbation-based method to compute variable attribution, which is based on sampling theory that can be used to estimate Shapley values. 41 The SVS produces feature ranking with respect to each feature input, allowing us to rank these variables based on their predictive power. Given that interpretability of neural networks is still an open research question, especially for temporal neural networks, we also provide results from 2 other methods, namely Integrated Gradient and Guided Backpropagation, to ensure that the variable importance results are consistent across the 3 methods. [42][43][44] Source code The entire code is available at https://github.com/mostafaalishahi/ Delirium_prediction_models.

Patient characteristics
The  As we were interested in making our model more sensitive for screening, we changed thresholds to have higher recall at the expense of precision, as such we assigned higher weights to the minority class (delirium positive). 45 Figure 5 ranks the features that have contributed to delirium prediction according to their relative importance in the eICU-CRD derived model. Ventilation, heart rate, age, white blood cell count, SOFA score, and vasopressor use are the highest ranked features across different prediction windows. Most of these features are also the highest ranked features when assessing interpretability in the MIMIC III cohort (Figure 6).

DISCUSSION
Our study shows that a machine learning model using only a few routine clinical variables replicated the performance of previously reported models that were developed using hundreds of variables. Our study successfully demonstrated that we could modify the performance of a model to fit our clinical needs as an effective screening tool. We took the following steps that helped us achieve our goal: (1) we studied the peak delirium onset time in our population and optimized the model to maximize predictive accuracy in that time frame; (2) we incorporated sliding windows in our model for continuous prediction across time and address drop in performance associated with predictions further ahead; and (3) we adjusted our thresholds to favor a high recall to ensure the model detects all patients at risk of delirium. Furthermore, we demonstrated that performance across different datasets diminish in accuracy and needs to be individualized to the population. Our features when ranked suggest older and more critically ill patients are at greater risk of delirium, especially in combination with mechanical ventilation and vasopressor therapy. Our model's ranking of features is consistent with what we already know as high-risk features. We have shared our code for replicating the results and recommend adjustments be made according to the specific setting and their needs. 46 Screening tools like CAM-ICU describe a snapshot in time and do not give an idea of the patient's progress nor are predictive. Strategies based on established best practices such as ABCDEF are resource intensive and challenging to implement universally. 12 Despite effective prevention strategies, delirium is still commonplace in the ICU highlighting a need for a screening tool that prioritizes patients at risk and allows us to exclude patients who are low risk from these timeconsuming therapies. Few models exist that can both accurately predict and be easy to implement. Most models use several hundred variables or use only a snapshot of features that can vary with time. Also, many of these models were trained on small datasets and use inconsistent approaches for collecting and/or stratifying data into training and validation cohorts limiting generalizability. [19][20][21] The Pre-Deliric and e-Pre-Deliric, were built with a handful of predictor variables from a large patient cohort, and been externally validated. 47,48 However, they employ data from admission variables that change and lose predictive power with time. Recent machine learning-based algorithms were able to predict delirium accurately but using over 700 predictor variables and were also criticized for their analysis methods. 21,49 Importantly, these models have not investigated how their performance changes with different prediction windows, optimal time of observation, capture the evolution of a patient's state through time and unable to adjust delirium risk estimation temporally. In our knowledge, our report is one of the first instances of delirium prediction, where we have not only tried to predict accurately across different scenarios but also addressed the issues with prior prediction models. Notably, we have iteratively developed our model to address the challenges that are posed by low incidence of delirium, temporal progression of disease, and different patient populations. Additionally, we have ventured into the realm of explaining how our features contribute, something that is rare in models using deep learning.
The BiLSTM-based model, which has the advantage of capturing temporal dependencies, performed the best of the 3 models evaluated, suggesting that the trajectory of predictive features is more informative than a single value. The simpler LR model is an attractive option if implementation is determined by computational limitations of a deep learning model. A longer observation window gained little in terms of model performance. A 48-h observation window even led to a drop in accuracy, but this is due to a decrease in the size of the training cohort. Another possibility is that factors contributing to delirium are proximal to its onset, further justifying the use of continuous prediction using a sliding window. The decay in performance of the algorithm as it predicts delirium with longer lead time is similar in both MIMIC-III and eICU-CRD. A screening tool needs to be sensitive. This is best addressed by a model with a high recall. We adjusted thresholds favoring a high recall while sacrificing precision (Supplementary eTables 2 and 3) to achieve this purpose. Also, it is desirable to have prediction algorithms that have short observation duration and predict the furthest ahead. In our case, since most (59% in eICU-CRD, 66% in MIMIC-III) delirium cases occurred within 48 h of ICU admission (Supplementary eFigure 3), hence we targeted performance for a 48-h prediction window with a 12-or 24-h observation window. We also demonstrated that as the prediction window moved beyond 48 h the model-maintained recall, but with a precipitous drop in precision. Non-trivial tuning of hyperparameters is required when algorithms are ported across populations. We suggest the performance of different observation and prediction times be studied on the local patient population and depending on the objective of the algorithm the optimal windows are determined. Furthermore, adapting the model to the local patient population could not only improve the predictive performance, but also calibration quality (Supplementary eFigure 4).
Delirium is precipitated through many factors, some that are unique to the ICU. Our variables were chosen a priori based on literature review. We only included variables that can be easily extracted in real time. Instead of using static values, we employed a sliding window for prediction and incorporated the trajectory of each variable over time. Our results indicate that this strategy predicts delirium more accurately than values captured at a moment in time and eliminates the need for long-term prediction.
Since we conducted a retrospective study, causality between the features and delirium cannot be established. Other limitations include selection bias (we excluded observations with missing CAM-ICU values) and interpreter bias (the data recorded in the databases might have been collected after the onset of delirium, given the discontinuous nature of CAM-ICU measurement). Additionally, CAM-ICU was scored by different nurses at separate times and in different units, potentially resulting in inter-operator variability. This study also does not have the ability to predict the duration or outcomes of each patient once delirium has occurred.

CONCLUSION
We successfully designed a delirium prediction model as a potential screening tool for ABCDEF bundle implementation. Using a few clinically relevant predictor variables we were able to achieve comparable performance to contemporary and well-reported models.
We were able to tackle the challenge presented by evolving temporal and treatment effects by using methods that captured temporal trends in data rather than static values and sliding observation windows, threshold adjustments to ensure consistently high recall. Additionally, we peeked at interpreting the model and shared our code online for reproducibility. We believe our model will help with identifying patients at risk of delirium early and will allow us to target preventive therapies, which is often time-consuming and personnelintensive, to the patients who are most likely to benefit.

FUNDING
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors. Features ranked according to their importance in descending order in long short-term memory model in eICU. Color shows whether ranked variable value is high (red) or low (blue) for that observation.